-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function #36713
base: main
Are you sure you want to change the base?
[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function #36713
Conversation
Signed-off-by: Yigithan Karabulut <[email protected]>
The committers listed above are authorized under a signed CLA. |
truncatedStr := stringVal[:limit] | ||
for !utf8.ValidString(truncatedStr) { | ||
limit-- | ||
if limit == 0 { | ||
value.SetStr("") | ||
return true | ||
} | ||
truncatedStr = stringVal[:limit] | ||
} | ||
value.SetStr(truncatedStr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since utf8.ValidString
requires creating a slice and checking all of it every loop, I think a simpler and more efficient solution is to check if the byte after the slice is a valid rune start byte:
truncatedStr := stringVal[:limit] | |
for !utf8.ValidString(truncatedStr) { | |
limit-- | |
if limit == 0 { | |
value.SetStr("") | |
return true | |
} | |
truncatedStr = stringVal[:limit] | |
} | |
value.SetStr(truncatedStr) | |
for limit > 0 && !utf8.RuneStart(stringVal[limit]) { | |
limit-- | |
} | |
value.SetStr(stringVal[:limit]) |
(Neither solution works all that well if the string is not valid UTF-8 in the first place, and both will gladly cut in the middle of multi-codepoint grapheme clusters, but I'm assuming these edge cases are not much of a concern compared to outputting invalid UTF-8.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. The way you wrote is faster and simpler, so it can be preferred, of course, if we assume that the string is generally UTF-8 compatible. In the other way, if we consider that a UTF-8 character is a maximum of 4 bytes, we can also validate the string by going back at most 3 times. The choice is yours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the simplest way to handle invalid UTF-8 input here would be to validate the whole string once; if it succeeds, proceed with the loop; otherwise, just cut at the byte level since it's not UTF-8 in the first place. What do you think?
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Description
Truncate_all will slice the string up to the given length. If truncating at exactly the length results in a broken UTF-8 encoding, it'll truncate before where the last UTF-8 character started.
Link to tracking issue
Fixes #36017
Testing
Two UTF-8 characters were added to the end of the string and the limit was adjusted to match them and the truncation process was tested.
Documentation
Updated pkg/ottl/ottfuncs/README.md with description.